Contingency Matrix (sklearn.metrics.cluster.contingency_matrix)#

The contingency matrix is a count table that summarizes how two labelings overlap. In clustering evaluation it is commonly built from:

  • labels_true: ground-truth classes (if available)

  • labels_pred: predicted cluster IDs (arbitrary up to permutation)

It is the fundamental object behind many external clustering scores (purity, mutual information, Rand index, …).


Quick import#

from sklearn.metrics.cluster import contingency_matrix

Learning goals#

  • Define the contingency matrix with clear math notation

  • Implement it from scratch in NumPy

  • Visualize and interpret it (raw counts + normalized views)

  • Use it to align cluster IDs to classes and compute simple derived scores (purity, “best-mapped” accuracy)

  • Use it for a simple hyperparameter search (choose k for k-means) when ground truth exists

import itertools

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from scipy import sparse as sp
from scipy.optimize import linear_sum_assignment

from sklearn.metrics.cluster import contingency_matrix

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) Definition#

Let there be \(n\) items. For each item \(i\) we have two labels:

  • a reference label (often ground truth) \(y_i\)

  • a predicted label (often a cluster ID) \(z_i\)

Assume the unique values in \(y\) are \(\mathcal{C} = \{c_1,\dots,c_C\}\) and in \(z\) are \(\mathcal{K} = \{k_1,\dots,k_K\}\).

The contingency matrix \(N \in \mathbb{R}^{C \times K}\) is defined by:

\[ N_{a,b} = \sum_{i=1}^n \mathbf{1}\{y_i=c_a \ \wedge \ z_i=k_b\} \]

Interpretation:

  • Row \(a\) describes how true class \(c_a\) is split across clusters.

  • Column \(b\) describes which true classes appear inside cluster \(k_b\).

Useful marginals:

\[ n_a = \sum_{b=1}^K N_{a,b} \quad (\text{size of class } c_a), \qquad m_b = \sum_{a=1}^C N_{a,b} \quad (\text{size of cluster } k_b) \]

and \(\sum_{a,b} N_{a,b} = n\).

Probabilistic view#

Dividing by \(n\) gives the empirical joint distribution:

\[ P_{a,b} = \frac{N_{a,b}}{n} \]

Many information-theoretic clustering metrics (e.g. mutual information) are computed from \(P\) (plus its row/column marginals).

eps in scikit-learn#

sklearn.metrics.cluster.contingency_matrix has an optional eps argument that adds a constant to every entry:

\[ N \leftarrow N + \varepsilon \]

This is mainly a numerical trick to avoid log(0) / NaN propagation in downstream computations. It changes the counts, so use it intentionally.

2) NumPy implementation (dense, plus an optional sparse variant)#

Below is a small from-scratch implementation.

Notes:

  • scikit-learn returns only the matrix; here we also return the unique label values so we can label axes in plots.

  • sparse=True returns a CSR matrix (useful when \(C\times K\) is huge but only few entries are non-zero).

def contingency_matrix_np(labels_true, labels_pred, *, eps=None, sparse=False, dtype=np.int64):
    """Build a contingency matrix N where N[a,b] = #{i : y_i=c_a and z_i=k_b}.

    Returns
    -------
    cm : ndarray or scipy.sparse.csr_matrix of shape (n_true, n_pred)
    true_labels : ndarray of shape (n_true,)
        Sorted unique values from labels_true (np.unique ordering).
    pred_labels : ndarray of shape (n_pred,)
        Sorted unique values from labels_pred (np.unique ordering).
    """
    y = np.asarray(labels_true)
    z = np.asarray(labels_pred)

    if y.shape != z.shape:
        raise ValueError(f"labels_true and labels_pred must have the same shape; got {y.shape} vs {z.shape}")

    y = y.ravel()
    z = z.ravel()

    true_labels, y_inv = np.unique(y, return_inverse=True)
    pred_labels, z_inv = np.unique(z, return_inverse=True)

    n_true = true_labels.size
    n_pred = pred_labels.size

    if sparse:
        if eps is not None:
            raise ValueError("eps must be None when sparse=True (matches scikit-learn behavior)")
        data = np.ones_like(y_inv, dtype=dtype)
        cm = sp.coo_matrix((data, (y_inv, z_inv)), shape=(n_true, n_pred), dtype=dtype).tocsr()
        return cm, true_labels, pred_labels

    cm = np.zeros((n_true, n_pred), dtype=dtype)
    np.add.at(cm, (y_inv, z_inv), 1)

    if eps is not None:
        cm = cm.astype(np.float64, copy=False) + float(eps)

    return cm, true_labels, pred_labels
labels_true = np.array(["cat", "cat", "dog", "dog", "dog", "fish", "fish"])
labels_pred = np.array([1, 1, 0, 0, 2, 2, 2])

cm_np, y_vals, z_vals = contingency_matrix_np(labels_true, labels_pred)
cm_sk = contingency_matrix(labels_true, labels_pred)

print("true label values:", y_vals)
print("pred label values:", z_vals)
print("\ncontingency (NumPy):\n", cm_np)
print("\ncontingency (sklearn):\n", cm_sk)
print("\nallclose:", np.allclose(cm_np, cm_sk))

cm_sparse, _, _ = contingency_matrix_np(labels_true, labels_pred, sparse=True)
print("\nsparse (csr) -> dense:\n", cm_sparse.toarray())
true label values: ['cat' 'dog' 'fish']
pred label values: [0 1 2]

contingency (NumPy):
 [[0 2 0]
 [2 0 1]
 [0 0 2]]

contingency (sklearn):
 [[0 2 0]
 [2 0 1]
 [0 0 2]]

allclose: True

sparse (csr) -> dense:
 [[0 2 0]
 [2 0 1]
 [0 0 2]]
fig = px.imshow(
    cm_np,
    text_auto=True,
    aspect="auto",
    x=[str(v) for v in z_vals],
    y=[str(v) for v in y_vals],
    labels={"x": "predicted label / cluster", "y": "true label", "color": "count"},
    title="Contingency matrix (raw counts)",
)
fig.update_layout(height=350)
fig.show()

3) Normalized views (row-wise vs column-wise)#

Raw counts are great for interpretability, but they scale with dataset size. Two common normalizations turn the table into conditional probabilities:

  • Row-normalized: \(N_{a,b} / n_a \approx P(z=k_b \mid y=c_a)\)

  • Column-normalized: \(N_{a,b} / m_b \approx P(y=c_a \mid z=k_b)\)

Row-normalization answers: “Given a true class, how does it get distributed over clusters?” Column-normalization answers: “Given a predicted cluster, what’s its composition?”

row_sums = cm_np.sum(axis=1, keepdims=True)
col_sums = cm_np.sum(axis=0, keepdims=True)

row_norm = np.divide(cm_np, row_sums, where=row_sums > 0)
col_norm = np.divide(cm_np, col_sums, where=col_sums > 0)

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Row-normalized: P(cluster | class)", "Column-normalized: P(class | cluster)"),
)

fig.add_trace(
    go.Heatmap(
        z=row_norm,
        x=[str(v) for v in z_vals],
        y=[str(v) for v in y_vals],
        zmin=0,
        zmax=1,
        colorbar=dict(title="prob"),
    ),
    row=1,
    col=1,
)

fig.add_trace(
    go.Heatmap(
        z=col_norm,
        x=[str(v) for v in z_vals],
        y=[str(v) for v in y_vals],
        zmin=0,
        zmax=1,
        showscale=False,
    ),
    row=1,
    col=2,
)

fig.update_layout(height=360, title="Same contingency table, different questions")
fig.show()

4) Label permutation (why clusters are different from classes)#

Cluster IDs have no semantic meaning: cluster 0 vs 1 is arbitrary. If we relabel clusters, the contingency matrix columns simply permute.

This is why many clustering scores are label-permutation invariant. If you compare contingency matrices directly, make sure you’re comparing them in a consistent column order.

# Permute the predicted labels (swap the meaning of cluster IDs)

# Here z_vals are [0, 1, 2], so we can permute by an index mapping.
perm = np.array([2, 0, 1])  # 0->2, 1->0, 2->1
labels_pred_perm = perm[labels_pred]

cm_perm, _, z_vals_perm = contingency_matrix_np(labels_true, labels_pred_perm)
inv_perm = np.argsort(perm)  # new->old

print("perm (old->new):", perm)
print("inv_perm (new->old):", inv_perm)
print("original pred label values:", z_vals)
print("permuted pred label values:", z_vals_perm)
print("cm_perm == cm_np[:, inv_perm] ?", np.allclose(cm_perm, cm_np[:, inv_perm]))

fig = make_subplots(rows=1, cols=2, subplot_titles=("Original", "After permuting cluster IDs"))

fig.add_trace(
    go.Heatmap(z=cm_np, x=[str(v) for v in z_vals], y=[str(v) for v in y_vals], text=cm_np, texttemplate="%{text}"),
    row=1,
    col=1,
)
fig.add_trace(
    go.Heatmap(
        z=cm_perm,
        x=[str(v) for v in z_vals_perm],
        y=[str(v) for v in y_vals],
        text=cm_perm,
        texttemplate="%{text}",
        showscale=False,
    ),
    row=1,
    col=2,
)

fig.update_layout(height=350, title="Permuting cluster IDs permutes contingency columns")
fig.show()
perm (old->new): [2 0 1]
inv_perm (new->old): [1 2 0]
original pred label values: [0 1 2]
permuted pred label values: [0 1 2]
cm_perm == cm_np[:, inv_perm] ? True

5) A 2D clustering example (synthetic data)#

We’ll create 3 Gaussian blobs with known labels (only for evaluation), run a simple NumPy k-means, and then visualize the resulting contingency matrix.

# Synthetic 2D dataset with 3 ground-truth classes

centers = np.array([
    [-2.0, -2.0],
    [0.0, 2.5],
    [2.5, -0.5],
])
std = 0.7
n_per_class = 180

X = np.vstack([rng.normal(loc=c, scale=std, size=(n_per_class, 2)) for c in centers])
y_true = np.repeat(np.arange(len(centers)), n_per_class)

# Shuffle so class blocks are mixed
perm_idx = rng.permutation(X.shape[0])
X = X[perm_idx]
y_true = y_true[perm_idx]

fig = px.scatter(
    x=X[:, 0],
    y=X[:, 1],
    color=y_true.astype(str),
    title="Ground truth classes (for evaluation only)",
    labels={"x": "x1", "y": "x2", "color": "true class"},
)
fig.show()
def kmeans_single_run(X, k, *, n_iters=50, rng=None):
    """Basic k-means (Lloyd's algorithm) for Euclidean distance.

    Returns
    -------
    labels : ndarray of shape (n_samples,)
    centers : ndarray of shape (k, n_features)
    inertia : float
        Sum of squared distances to assigned centers.
    """
    if rng is None:
        rng = np.random.default_rng()

    X = np.asarray(X, dtype=float)
    n_samples = X.shape[0]

    # Init centers by sampling points
    init_idx = rng.choice(n_samples, size=k, replace=False)
    centers = X[init_idx].copy()

    labels = np.zeros(n_samples, dtype=int)

    for _ in range(n_iters):
        # Assignment step
        d2 = np.sum((X[:, None, :] - centers[None, :, :]) ** 2, axis=2)  # (n_samples, k)
        new_labels = np.argmin(d2, axis=1)

        # Update step
        new_centers = centers.copy()
        for j in range(k):
            mask = new_labels == j
            if not np.any(mask):
                # Empty cluster -> re-init to a random point
                new_centers[j] = X[rng.integers(0, n_samples)]
            else:
                new_centers[j] = X[mask].mean(axis=0)

        if np.array_equal(new_labels, labels):
            centers = new_centers
            labels = new_labels
            break

        centers = new_centers
        labels = new_labels

    inertia = float(np.sum((X - centers[labels]) ** 2))
    return labels, centers, inertia


def kmeans_np(X, k, *, n_iters=50, n_init=10, rng=None):
    """Run k-means with multiple random initializations and keep the best inertia."""
    if rng is None:
        rng = np.random.default_rng()

    best_labels = None
    best_centers = None
    best_inertia = np.inf

    for _ in range(n_init):
        labels, centers, inertia = kmeans_single_run(X, k, n_iters=n_iters, rng=rng)
        if inertia < best_inertia:
            best_inertia = inertia
            best_labels = labels
            best_centers = centers

    return best_labels, best_centers, float(best_inertia)
k = 3
labels_km, centers_km, inertia_km = kmeans_np(X, k, n_iters=60, n_init=8, rng=rng)

print("k-means inertia (SSE):", inertia_km)

fig = px.scatter(
    x=X[:, 0],
    y=X[:, 1],
    color=labels_km.astype(str),
    title="Predicted k-means clusters (k=3)",
    labels={"x": "x1", "y": "x2", "color": "cluster"},
)
fig.add_trace(
    go.Scatter(
        x=centers_km[:, 0],
        y=centers_km[:, 1],
        mode="markers",
        name="centers",
        marker=dict(symbol="x", size=14, color="black"),
    )
)
fig.show()
k-means inertia (SSE): 511.10262206779885
cm, y_vals, z_vals = contingency_matrix_np(y_true, labels_km)

fig = px.imshow(
    cm,
    text_auto=True,
    aspect="auto",
    x=[str(v) for v in z_vals],
    y=[str(v) for v in y_vals],
    labels={"x": "cluster", "y": "true class", "color": "count"},
    title="Contingency matrix for k-means vs ground truth",
)
fig.update_layout(height=360)
fig.show()

# Also show column-normalized view: P(class | cluster)
col_sums = cm.sum(axis=0, keepdims=True)
cm_col_norm = np.divide(cm, col_sums, where=col_sums > 0)

fig = px.imshow(
    cm_col_norm,
    text_auto=".2f",
    aspect="auto",
    x=[str(v) for v in z_vals],
    y=[str(v) for v in y_vals],
    labels={"x": "cluster", "y": "true class", "color": "P(class | cluster)"},
    title="Column-normalized: how pure is each cluster?",
    zmin=0,
    zmax=1,
)
fig.update_layout(height=360)
fig.show()

6) Derived scores + label alignment#

The contingency matrix itself is not a single “score”, but you can derive many useful quantities.

Purity#

A simple external clustering score is purity:

\[ \mathrm{Purity}(y,z) = \frac{1}{n} \sum_{b=1}^K \max_{a \in \{1,\dots,C\}} N_{a,b} \]

It measures how dominated each cluster is by its majority class. Purity increases as you increase \(K\) (more clusters) and can reach 1.0 when every point is its own cluster.

“Best-mapped” accuracy#

Because cluster IDs are arbitrary, people sometimes map each cluster to a class and then compute an accuracy-like number.

  • Majority-vote mapping: map each cluster to its majority class.

  • One-to-one mapping (when \(K=C\)): choose a permutation that maximizes the total matches.

These are useful for interpretation and benchmarking, but they are not a loss you can optimize with gradients.

def purity_from_contingency(cm):
    cm = np.asarray(cm)
    return float(np.sum(np.max(cm, axis=0)) / np.sum(cm))


def majority_vote_mapping(cm, true_labels, pred_labels):
    """Map each predicted label to the true label that is most frequent in that cluster."""
    cm = np.asarray(cm)
    best_true_idx_per_cluster = np.argmax(cm, axis=0)
    mapping = {pred_labels[j]: true_labels[i] for j, i in enumerate(best_true_idx_per_cluster)}
    return mapping


purity = purity_from_contingency(cm)
mv_map = majority_vote_mapping(cm, y_vals, z_vals)
y_pred_mv = np.array([mv_map[z] for z in labels_km])
acc_mv = float(np.mean(y_pred_mv == y_true))

print("purity:", purity)
print("majority-vote mapping:", mv_map)
print("majority-vote mapped accuracy:", acc_mv)
purity: 0.9981481481481481
majority-vote mapping: {0: 1, 1: 2, 2: 0}
majority-vote mapped accuracy: 0.9981481481481481
# One-to-one label alignment via the assignment problem (Hungarian algorithm)
# Works best when K == C; otherwise it matches only min(C, K) pairs.

row_ind, col_ind = linear_sum_assignment(-cm)  # maximize sum of counts
hungarian_map = {z_vals[j]: y_vals[i] for i, j in zip(row_ind, col_ind)}
y_pred_h = np.array([hungarian_map[z] for z in labels_km])
acc_h = float(np.mean(y_pred_h == y_true))

print("Hungarian mapping:", hungarian_map)
print("Hungarian mapped accuracy:", acc_h)
Hungarian mapping: {2: 0, 0: 1, 1: 2}
Hungarian mapped accuracy: 0.9981481481481481

7) Using it for a simple optimization: choosing k for k-means (external criterion)#

The contingency matrix requires ground truth labels, so it is mainly used for:

  • benchmarking clustering algorithms on labeled datasets

  • semi-supervised settings (you do have labels)

  • debugging / sanity checks on synthetic data

A simple way to “optimize” with it is to pick hyperparameters that maximize a derived external score (here: purity).

We’ll compare:

  • Purity (external, uses y_true)

  • Inertia / SSE (internal k-means objective, does not use labels)

def purity_score_np(y_true, z_pred):
    cm, _, _ = contingency_matrix_np(y_true, z_pred)
    return purity_from_contingency(cm)


ks = list(range(2, 9))
purities = []
inertias = []

for k in ks:
    labels_k, centers_k, inertia_k = kmeans_np(X, k, n_iters=70, n_init=10, rng=rng)
    purities.append(purity_score_np(y_true, labels_k))
    inertias.append(inertia_k)

fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(go.Scatter(x=ks, y=purities, mode="lines+markers", name="Purity"), secondary_y=False)
fig.add_trace(go.Scatter(x=ks, y=inertias, mode="lines+markers", name="Inertia (SSE)"), secondary_y=True)

fig.update_xaxes(title_text="k (#clusters)")
fig.update_yaxes(title_text="Purity (higher better)", range=[0, 1.05], secondary_y=False)
fig.update_yaxes(title_text="Inertia / SSE (lower better)", secondary_y=True)
fig.update_layout(title="Hyperparameter selection with contingency-derived purity")
fig.show()
# Multiple restarts: internal objective (inertia) vs external purity

k = 3
seeds = rng.integers(0, 1_000_000, size=40)

purity_list = []
inertia_list = []

for s in seeds:
    labels_s, centers_s, inertia_s = kmeans_single_run(X, k, n_iters=70, rng=np.random.default_rng(int(s)))
    inertia_list.append(inertia_s)
    purity_list.append(purity_score_np(y_true, labels_s))

fig = px.scatter(
    x=inertia_list,
    y=purity_list,
    title="Random restarts (k=3): inertia vs purity",
    labels={"x": "inertia / SSE", "y": "purity"},
)
fig.show()

8) Pros, cons, and when to use it#

Pros#

  • Interpretable: a direct count table of overlaps.

  • Fast: \(\mathcal{O}(n)\) time to build once labels are mapped to indices.

  • Foundational: many clustering metrics are functions of the contingency matrix.

  • Flexible: works with any hashable labels (ints, strings, …).

  • Sparse-friendly: can be represented efficiently when \(C\times K\) is huge but most pairs never occur.

Cons#

  • Not a scalar score: you usually need an additional reduction (purity, MI, ARI, …).

  • Requires reference labels (labels_true): not available in pure unsupervised clustering.

  • Easy to game with \(K\) (more clusters) if you turn it into purity/accuracy-like scores.

  • Comparisons need care: raw counts depend on dataset size and class imbalance.

Good uses#

  • Debugging clustering pipelines on labeled or synthetic datasets.

  • Understanding failure modes (which classes get mixed/split).

  • As an intermediate object to compute robust, permutation-invariant clustering scores.

9) Pitfalls + diagnostics#

  • Axis confusion: rows correspond to labels_true, columns to labels_pred.

  • Permutation invariance: cluster IDs are arbitrary; a “different looking” matrix may just be a column permutation.

  • Imbalance: a large class can dominate counts; inspect normalized views (\(P(\text{cluster}|\text{class})\) and \(P(\text{class}|\text{cluster})\)).

  • Different numbers of classes/clusters: one-to-one alignment is not well-defined when \(K\neq C\).

  • Using labels for tuning: choosing hyperparameters by maximizing a contingency-derived score leaks supervision.

  • eps changes the table: only use it for numerical reasons in downstream log-based metrics.

10) Exercises#

  1. Implement a dense contingency matrix builder using only np.bincount (hint: flatten 2D indices).

  2. For the synthetic dataset, compare purity to a permutation-invariant score like ARI or NMI.

  3. Show how purity behaves as you let \(k\) grow all the way to the number of points.

  4. Implement a one-to-one alignment using brute-force permutations for \(K\leq 7\) and compare to the Hungarian solution.

References#

  • scikit-learn docs: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cluster.contingency_matrix.html